Skip to content

Conversation

@victorapm
Copy link
Contributor

@victorapm victorapm commented Sep 23, 2025

Add test script for LC's matrix

TODO:

  • cmake-cuda-um-dbg
  • cmake-cuda-um-mixedint
  • cmake-cuda-um-shared
  • cmake-cuda-um-single
  • cmake-cuda-um-without-MPI
  • cmake-cuda-cpu
  • cmake-cuda-bench
  • Autotools (same as CMake but without runs)
  • cmake-cuda-cpu baselines (Not needed since we rely on the .saved files used by tux)

Notes:

  • Internal code changes are needed to get the tests working when building hypre with GPU support on a machine without GPUs (e.g., dane) while setting the execution policy to host.
  • hypre's internal memory tracker was disabled temporarily due to crashes on matrix. To be investigated.
  • Residual tolerance for a couple of TEST_ij/solvers.sh runs was lowered so we can reuse the same saved files as tux when building with GPU support but setting the execution policy to host
  • Runs on a single node of matrix. Currently takes about 6h (to be improved)
  • Example job script for launching on matrix below:
#!/bin/bash
#SBATCH -t 10:00:00
#SBATCH -p pbatch
#SBATCH -J hypre
#SBATCH -o machine-matrix.out
#SBATCH -e machine-matrix.err
#SBATCH -N 1
#SBATCH -G 4
#SBATCH --exclusive

### Shell scripting
date; hostname; git branch --show-current
pwd=`pwd`
cd $(pwd)/AUTOTEST
echo -e "Current directory: $(pwd)"

### Launch parallel executable
./test.sh ./machine-matrix.sh ../src

date;
echo 'Done'

@victorapm victorapm added the After 3.0 Code changes that can wait to be merged after hypre 3.0 label Sep 23, 2025
Base automatically changed from hypre-3.0 to master September 26, 2025 14:04
@victorapm victorapm mentioned this pull request Sep 27, 2025
@rfalgout
Copy link
Contributor

Hi @victorapm . I took a quick look through this, even though I'm technically not a reviewer. :) It generally looks good to me. The only question I have is if you've diff'ed these saved files with ones from another machine to make sure there is nothing out of the ordinary. A few of the saved files had final residual numbers that were unexpected at first glance (I didn't dig into anything further). Thanks!

@victorapm
Copy link
Contributor Author

Hi Rob, I was going to ask your review here after this is complete, so thanks for early input!

I haven't done that yet, but sounds a good idea to me. Could you point out a couple of examples where you saw the unexpected residual numbers? I'll take a closer look at those

Final Relative Residual Norm = 0.000000e+00

# Output file: cycred.out.3Dx.5
Final Relative Residual Norm = 3.200000e+02
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one example. The exact zero results in this file also stand out, though they could be just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the results on matrix (this file) match the baselines used in tux (cycred.saved). I agree these residual values look weird though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rob, I checked the other saved.matrix files against the corresponding saved.aurora and they all look similar (a couple of iterations different in some cases and residual norms in the range of the requested error tolerance).

@victorapm victorapm changed the title [WIP]: Add machine matrix Add machine matrix Jan 16, 2026
@victorapm victorapm marked this pull request as ready for review January 16, 2026 17:29
# Output file: solvers.out.113
GMRES Iterations = 25
Final GMRES Relative Residual Norm = 8.744056e-09
GMRES Iterations = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for this drop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the requested tolerance to 1e-3 so the residual norm computed on matrix and tux match to the precision printed here. This way, we can reuse the same .saved files from tux when running on matrix with host execution

Copy link
Contributor

@rfalgout rfalgout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

After 3.0 Code changes that can wait to be merged after hypre 3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants